{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第10章 映射、哈希表和跳跃表"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.1 映射和字典"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "key唯一，value不唯一，不唯一是说不同key可能映射到同一个value上。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.1.1 Map的ADT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. M[k]: 存在则返回映射M中键k对应的value，如果不存在则报错。（在Python中定义特殊方法`__getitem__`来实现）\n",
    "2. M[k] = v: 存在则用v替换键k对应的value，不存在则新建键k映射到v。（`__setitem__`）\n",
    "3. del M[k]: 存在则删除键k对应的元组，不存在则报错。（`__delitem__`）\n",
    "4. len(M): 返回映射M中元组的数量。(`__len__`)\n",
    "5. iter(M): 生成包含所有`键`的迭代器。（`__iter__`） "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上述方法是主要的Map支持的操作，下面是一些其他的操作：\n",
    "\n",
    "1. k in M: M的键中是否有k。（__contains__）\n",
    "2. M.get(k, d=None): 如果M中存在键值k，则返回M[k]，否则返回d，d默认为None。（与M[k]相比避免了Error的出现）\n",
    "3. M.setdefault(k, d): 如果M中存在键值k，则返回M[k]，否则设置M[k]=d，并返回d。\n",
    "4. M.pop(k, d=None): 如果M中存在键值k，则删除键值为k的元组并返回value，否则返回d。（与del M[k]差不多，只是不会raise Error）\n",
    "5. M.popitem(): 随机删除一个元组并返回该元组，如果M为空，则raise Error。\n",
    "6. M.clear(): 删除所有元组。\n",
    "7. M.keys(): 返回一个类似集合的M的键的view。\n",
    "8. M.values(): 同上理。\n",
    "9. M.items(): 同上理。\n",
    "10. M.update(M2): 将M2中所有元组添加到M中。\n",
    "11. M == M2: 所有元组都相同，返回True。\n",
    "12. M != M2: M和M2有一个元组不同，返回True。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.1.3 Python的MutableMapping抽象基类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "抽象基类提供了子类通用的方法的模板，但是具体的实现由子类来完成，不过抽象基类可能定义一些基于未定义的基本方法的扩展方法，子类可以直接继承而无需再定义，比如抽象基类定义了`__len__`而没有具体实现，但是`is_empty`可以根据`__len__`来实现，他们的逻辑关系可以在抽象基类中用来实现`is_empty`。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "抽象基类的意义在于提供一个框架，提供一些子类所需要定义和实现的基本方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "MutableMapping类为很多方法都提供了实现，基于五个基本的方法，但是这五个基本的方法需要在子类中实现。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.1.4 自己编写MapBase类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections.abc import MutableMapping\n",
    "\n",
    "class MapBase(MutableMapping):\n",
    "    \n",
    "    class _Item:\n",
    "        \n",
    "        __slots__ = '_key', '_value'\n",
    "        \n",
    "        def __init__(self, k, v):\n",
    "            self._key = k\n",
    "            self._value = v\n",
    "        \n",
    "        def __eq__(self, other):\n",
    "            return self._key == other._key\n",
    "        \n",
    "        def __ne__(self, other):\n",
    "            return not (self == other)\n",
    "        \n",
    "        def __lt__(self, other):\n",
    "            return self._key < other._key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.1.5 UnsortedTableMap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "class UnsortedTableMap(MapBase):\n",
    "\n",
    "    def __init__(self):\n",
    "        self._table = []                         ## tuple组成的list，还不具备散列表的特征\n",
    "\n",
    "    def __getitem__(self, k):                    ## 这里还是O(n)\n",
    "        for item in self._table:\n",
    "            if k == item._key:\n",
    "                return item._value\n",
    "        raise KeyError('Key Error: ' + repr(k))\n",
    "\n",
    "    def __setitem__(self, k, v):\n",
    "        for item in self._table:\n",
    "            if item._key == k:\n",
    "                item._value = v\n",
    "                return                           \n",
    "        self._table.append(self._Item(k, v))\n",
    "    \n",
    "    def __delitem__(self, k):\n",
    "        for i in range(len(self._table)):\n",
    "            if k == self._table[i]._key:\n",
    "                del self._table[i]\n",
    "                return \n",
    "        raise KeyError('Key Error: ' + repr(k))\n",
    "    \n",
    "    def __len__(self):\n",
    "        return len(self._table)\n",
    "    \n",
    "    def __iter__(self):\n",
    "        for item in self._table:\n",
    "            yield item._key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这种列表很简单，但是没有效率，访问需要O(n)的时间复杂度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.2 哈希表（散列表）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python使用`hash table`实现`dict`。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过键访问`hash table`的效率类似于通过索引访问数组；数组支持索引的O(1)访问是因为数组可以通过索引结合数组第一个元素的内存地址在O(1)时间内获得相应索引存储的元素的内存地址，`hash table`也一样，但不是通过第一个元素获得内存地址，而是在一种特定的映射下，通过键直接计算出内存地址，键可以看作是值的索引。在这种情况下，`__getitem__`、`__setitem__`、`__delitem__`都能在O(1)时间内完成。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面通过`hash function`将键映射为内存地址，也可以理解为值存储在一个数组中，键映射为索引，再访问数组。在一些情况下，不同的键映射到相同的索引或者内存地址上，需要在索引或者内存地址处管理一个桶，这个桶存储多个键值对元组。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.2.1 哈希函数（散列函数）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "桶数组：数组每个元素都是一个桶。（这里每个桶存储多个元组）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希函数：设桶数组长度为N，则哈希函数h将k映射为h(k)，范围是0~(N-1)，将键值对元组(k, v)存储在索引为h(k)的桶中。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "冲突（collision）：两个或者两个以上的键映射到同一个索引，即一个桶中存储了两个或者两个以上的键值对元组——>发生了冲突。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好的哈希函数：最小化冲突的发生，即尽量分散地将键k映射到不同的索引上。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希函数拆分：先将键转化为哈希码（整数），再通过压缩函数将哈希码转化为索引。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于压缩函数的存在，哈希码可以转化为合法的索引，这其实受限于桶数组的长度；一个很好的实现是为所有对象制定一个通用的转化为哈希码的规则，而压缩函数则根据桶数组的长度具体定义。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.2.2 哈希码"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希函数的第一步是将键转化为哈希码，如果哈希码冲突，压缩函数也会冲突。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "三种哈希码：\n",
    "\n",
    "1. 将整数直接作为哈希码，或者简单处理之后作为哈希码——对字符串等对象不友好。\n",
    "2. 多项式哈希码：以字符串为例，每个字符都可以转化为数字，使用一个常数a，将每个字符作为系数，计算关于a的多项式的值作为哈希码。\n",
    "3. 循环位移哈希码：多项式哈希码在相加过程中用a赋予权重，循环位移哈希码则是在每次相加之前都将之前的和做一个位移。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "循环位移哈希码的实现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def hash_code(s):\n",
    "    mask = (1 << 32) - 1\n",
    "    h = 0\n",
    "    for character in s:\n",
    "        h = (h << 5 & mask) | (h >> 27)\n",
    "        h += ord(character)\n",
    "    return h"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python内置函数`hash`可以返回输入对象的哈希码，只有不可变的数据类型是可哈希的（确保在一个对象的生命周期内，哈希码保持不变）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "元组的哈希码计算和字符串类似，都是单个元素的哈希码在某种方式上的结合。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "默认情况下，将hash函数直接作用于自定义的类是会报错的，但是可以通过定义类的`__hash__`方法来决定如何返回类的实例的哈希码，比如一个RGB颜色类的实例，可以定义`__hash__`为返回三个元素的元组的哈希码，这里就将类的实例的哈希码与元组的哈希码挂钩了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果定义了`__eq__`方法，那么`a == b`的情况下必须保证`hash(a) == hash(b)`，这由`__hash__`方法的定义来决定。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.2.3 压缩函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "压缩函数将哈希码映射到0到N-1的区间上（N是桶数组的长度），事实上0到N-1的每一个数字都代表着一个内存地址。一个好的压缩函数会使冲突降到最小。冲突有两个来源：一个是相同的哈希码，一个是不同的哈希码压缩成了相同的索引。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一个好的压缩函数可以确保不同的键映射为同一个索引的概率是1/N。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一个比较简单的方法是哈希码取N的余数，此时N最好取素数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一个比较好的方法是：\\[(ai + b) mod p\\] mod N，a、b范围是\\[0, p - 1\\]，p是比N更大的素数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.2.4 冲突处理方案"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "负载因子（load factor）：n/N，N为桶的个数，n为总共的元组的数量。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "冲突的出现有以下两种解决办法："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 分离链表（separate chaining）：桶数组的每个元素都是一个链表，链表可以存储多个元组。这种方法比较耗费空间。\n",
    "2. 开放寻址（open addressing）：如果哈希函数计算出的索引i已经被占用，则往后走（(i + 1) mod N），直到找到空桶（这里空桶不是真的空，而是用一个特殊的对象来表示，空桶会直接报错，需要避免），每个桶只存储一个元组，同样地，在删除时也需要从最初始的索引开始搜索（这叫线性探测），搜索到相应的k后进行删除，同样地，不是直接删除变成空桶。开放寻址的问题可能会产生聚集问题，桶填充超过一般后，搜索效率可能会很低。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "二次探测：(h(k) + $i^2$) mod N作为下一次探测的桶。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.2.5 负载因子、重新哈希和效率"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用separate chaining一般要求负载因子 < 0.9，使用开放寻址结合线性探测一般要求负载因子 < 0.5，其他探测方法，负载因子一般 < 2/3。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "重新哈希：插入新元组导致负载因子过大，这时会建立一张更大的哈希表，增大N以降低负载因子。在分析时间复杂度时，应该用类似摊销分析的方式来考虑重新哈希带来的时间复杂度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希表的各个方法时间复杂度：注意，使用separate chaining，各个桶中平均的元组数量为不小于n/N的最大整数，如果n是O(N)，那么数量就是O(1)，在这些数量中进行查找时间也是O(1)，最坏情况下，所有元组都在一个桶中，那么查找时间为O(n)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希表和列表各个方法的时间复杂度（这里的列表是键值对元组的列表，`__getitem__`也不是根据索引）："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "操作 | 列表 | 哈希表 | 哈希表最差情况\n",
    ":-: | :-: | :-: | :-:\n",
    "\\_\\_getitem\\_\\_ | O(n) | O(1) | O(n) \n",
    "\\_\\_setitem\\_\\_ | O(n) | O(1) | O(n)\n",
    "\\_\\_delitem\\_\\_ | O(n) | O(1) | O(n)\n",
    "\\_\\_len\\_\\_ | O(1) | O(1) | O(1)\n",
    "\\_\\_iter\\_\\_ | O(n) | O(n) | O(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "hash table是实现map最有效的方式之一，map需要通过key访问value，如果简单地用list来存储key和value的tuple，那么访问需要O(n)，如果使用hash，将key和内存地址直接挂钩（哈希码和压缩函数），那么访问将变得十分高效。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "dict是用hash table实现的map。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python的命名空间是dict，如`c = a + b`，将通过`__getitem__`访问a和b的value，再通过`__setitem__`修改c的值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.2.6 Python哈希表的实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希表是实现map的一种方式，分离链表和开放寻址是实现哈希表的两种方式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "map可以有多种实现方式，哈希表是其中之一，对于哈希表，不同的压缩函数可以是不同的实现，处理冲突的不同方法也可以是不同的实现，如separate chaining和开放寻址是哈希表的不同实现。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于内存地址是抽象的，所以可以通过数组的索引，将key映射到索引，间接上其实也映射到了内存地址。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在编写`hash table`的两种实现——分离链表和开放寻址之前，先在原来`MapBase`的基础上，编写`HashMapBase`类。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "from random import randrange\n",
    "\n",
    "class HashMapBase(MapBase):\n",
    "\n",
    "    def __init__(self, cap=11, p=109345121):\n",
    "        self._table = cap * [None]           ## list为桶数组，cap为初始长度\n",
    "        self._n = 0                          ## 存储的元组的数量\n",
    "        self._prime = p                      ## 用于压缩函数MAD\n",
    "        self._scale = 1 + randrange(p - 1)   ## MAD中的a\n",
    "        self._shift = randrange(p)           ## MAD中的b\n",
    "\n",
    "    def _hash_function(self, k):             ## 含转化为哈希码和压缩两个过程\n",
    "        return (hash(k) * self._scale + self._shift) % self._prime % len(self._table)\n",
    "\n",
    "    def __len__(self):\n",
    "        return self._n\n",
    "\n",
    "    def __getitem__(self, k):\n",
    "        j = self._hash_function(k)           ## 通过哈希函数计算索引\n",
    "        return self._bucket_getitem(j, k)    ## 分离链表和开放寻址在这个函数应该有所不同，在k和j条件下找到k对应的value\n",
    "\n",
    "    def __setitem__(self, k, v):\n",
    "        j = self._hash_function(k)\n",
    "        self._bucket_setitem(j, k, v)        ## 子类中具体实现\n",
    "        if self._n > len(self._table) // 2:  ## 动态\n",
    "            self._resize(2 * len(self._table) - 1)\n",
    "    \n",
    "    def __delitem__(self, k):\n",
    "        j = self._hash_function(k)\n",
    "        self._bucket_delitem(j, k)\n",
    "        self._n -= 1\n",
    "    \n",
    "    def _resize(self, c):\n",
    "        old = list(self.items())\n",
    "        self._table = c * [None]\n",
    "        self._n = 0\n",
    "        for (k, v) in old:\n",
    "            self[k] = v"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从key获得索引的过程，分离链表和开放寻址都是一样，区别在于之后如何进行`setitem`、`getitem`和`delitem`，这三个方法将在两个子类中具体实现。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ChainHashMap(HashMapBase):\n",
    "    \n",
    "    def _bucket_getitem(self, j, k):\n",
    "        bucket = self._table[j]\n",
    "        if bucket is None:\n",
    "            raise KeyError('Key Error: ' + repr(k))\n",
    "        return bucket[k]                              ## 如果找不到会报错\n",
    "    \n",
    "    def _bucket_setitem(self, j, k, v):\n",
    "        if self._table[j] is None:                    ## 如果为空，将新建一个UnsortedTableMap，本质就是存储_Item的list\n",
    "            self._table[j] = UnsortedTableMap()\n",
    "        oldsize = len(self._table[j])\n",
    "        self._table[j][k] = v                         ## 调用UnsortedTableMap的方法\n",
    "        if len(self._table[j]) > oldsize:             ## 判断k是老键还是新键，如果新键，存储元组的数量将+1\n",
    "            self._n += 1\n",
    "    \n",
    "    def _bucket_delitem(self, j, k):\n",
    "        bucket = self._table[j]\n",
    "        if bucket is None:\n",
    "            raise KeyError('Key Error: ' + repr(k))\n",
    "        del bucket[k]                                 ## 调用UnsortedTableMap的方法\n",
    "    \n",
    "    def __iter__(self):\n",
    "        for bucket in self._table:\n",
    "            if bucket is not None:\n",
    "                for k in bucket:                      ## 调用UnsortedTableMap的方法\n",
    "                    yield k"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用线性探测的开放寻址实现哈希表:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ProbeHashMap(HashMapBase):\n",
    "\n",
    "    _AVAIL = object()                                ## 哨兵\n",
    "\n",
    "    def _is_available(self, j):                      ## None是未被插入元组，_AVAIL是插入元组后删除\n",
    "        return self._table[j] is None or self._table[j] is ProbeHashMap._AVAIL\n",
    "\n",
    "    def _find_slot(self, j, k):\n",
    "        firstAvail = None\n",
    "        while True:\n",
    "            if self._is_available(j):                ## 未插入或者已插入被删除的索引\n",
    "                if firstAvail is None:               ## 保存第一个空索引，无论是未插入还是插入后删除\n",
    "                    firstAvail = j\n",
    "                if self._table[j] is None:           ## 如果出现None，说明找不到\n",
    "                    return (False, firstAvail)       ## 找不到返回第一个空的索引，用于插入\n",
    "            elif k == self._table[j]._key:           ## 找到了\n",
    "                return (True, j)\n",
    "            j = (j + 1) % len(self._table)\n",
    "\n",
    "    def _bucket_getitem(self, j, k):                 ## get、set和del都分为两种情况，找到和没找到\n",
    "        found, s = self._find_slot(j ,k)\n",
    "        if not found:                                ## 没找到报错\n",
    "            raise KeyError('Key Error: ' + repr(k))\n",
    "        return self._table[s]._value                 ## 找到返回\n",
    "    \n",
    "    def _bucket_setitem(self, j, k, v):\n",
    "        found, s = self._find_slot(j, k)      \n",
    "        if not found:                                ## 没找到插入\n",
    "            self._table[s] = self._Item(k, v)\n",
    "            self._n += 1\n",
    "        else:                                        ## 找到设置新的值\n",
    "            self._table[s]._value = v\n",
    "    \n",
    "    def _bucket_delitem(self, j, k):\n",
    "        found, s = self._find_slot(j, k)\n",
    "        if not found:                                ## 没找到报错\n",
    "            raise KeyError('Key Error: ' + repr(k))\n",
    "        self._table[s] = ProbeHashMap._AVAIL         ## 找到删除\n",
    "    \n",
    "    def __iter__(self):\n",
    "        for j in range(len(self._table)):\n",
    "            if not self._is_available(j):\n",
    "                yield self._table[j]._key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.3 有序映射"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有序映射相比标准映射多出以下方法：\n",
    "\n",
    "1. M.find_min(): 返回最小`键`对应的键值对元组。\n",
    "2. M.find_max(): 同上。\n",
    "3. M.find_lt(k): 在严格小于k的键中，返回最大`键`对应的键值对元组。\n",
    "4. M.find_le(k): 同上，改为小于等于。\n",
    "5. M.find_gt(k): 在严格大于k的键中，返回最小`键`对应的键值对元组。\n",
    "6. M.find_ge(k): 同上，改为大于等于。\n",
    "7. M.find_range(start, stop): 左闭右开返回键在这个范围内键值对元组的迭代器，如果start为None，取最小键，如果stop为None，最终结果一直包含到最大键。\n",
    "8. iter(M): 所有键的迭代器（按键的大小排序）。\n",
    "9. reversed(M): 键的反序迭代器。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.3.1 排序检索表"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "排序检索表（sorted search table）: 将map的元组的存储在一个数组中，按照键的顺序排列。（map的另一种实现，跟哈希表无关，哈希表的特点在于键映射到索引）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "排序检索表虽然不能跟哈希表一样，通过键直接映射到索引，但是根据键进行排序后，可以使用高效的二分查找。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在下面的二分查找中，最后如果high = low + 2，下一步必然有high = low；如果high = low + 1，下一步可能high < low，也可能high = low；分别考虑以上情况，如果从high = low + 1出发，下一步high < low，那么low处key > k，high + 1得到low，而low - 1之前肯定是已经判断过 <k 的，所以low是大于k的最小索引；如果low = high，首先肯定high + 1是 >k 的，low - 1是 <k 的，分情况继续考虑即可。剩下的情况是，k的值超出了整个序列键的取值范围，这个单独考虑。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 这里的一个重点是使用二分查找来精确查找和不精确查找（包括大于、大于等于的最小key，小于、小于等于的最大key）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "class SortedTableMap(MapBase):\n",
    "\n",
    "    def __init__(self):\n",
    "        self._table = []\n",
    "\n",
    "    def _find_index(self, k, low, high):                             ## 二分查找\n",
    "        if high < low:\n",
    "            return high + 1                                          ## 索引high的key与k大小未定，但是high + 1一定大于k\n",
    "        else:\n",
    "            mid = (low + high) // 2\n",
    "            if k == self._table[mid]._key:\n",
    "                return mid\n",
    "            elif k < self._table[mid].key:\n",
    "                return self._find_index(self, k, low, mid - 1)\n",
    "            else:\n",
    "                return self._find_index(self, k, mid + 1, high)\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self._table)\n",
    "\n",
    "    def __getitem__(self, k):\n",
    "        j = self._find_index(k, 0, len(self._table) - 1)\n",
    "        if j == len(self._table) or self._table[j]._key != k:       ## 查找失败\n",
    "            raise KeyError('Key Error: ' + repr(k))\n",
    "        return self._table[j]._value\n",
    "\n",
    "    def __setitem__(self, k, v):\n",
    "        j = self._find_index(k, 0, len(self._table) - 1)\n",
    "        if j < len(self._table) and self._table[j]._key == k:       ## 查找成功\n",
    "            self._table[j]._value = v\n",
    "        else:\n",
    "            self._table.insert(j, self._Item(k, v))\n",
    "\n",
    "    def __delitem__(self, k):\n",
    "        j = self._find_index(k, 0, len(self._table) - 1)\n",
    "        if j == len(self._table) or self._table[j]._key != k:       ## 查找失败\n",
    "            raise KeyError('Key Error: ' + repr(k))\n",
    "        del self._table[j]\n",
    "\n",
    "    def __iter__(self):\n",
    "        for item in self._table:\n",
    "            yield item._key\n",
    "\n",
    "    def __reversed__(self):\n",
    "        for item in reversed(self._table):\n",
    "            yield item._key\n",
    "\n",
    "    def find_min(self):\n",
    "        if len(self._table) > 0:\n",
    "            return self._table[0]._key, self._table[0]._value\n",
    "        else:\n",
    "            return None\n",
    "\n",
    "    def find_max(self):\n",
    "        if len(self._table) > 0:\n",
    "            return self._table[-1]._key, self._table[-1]._value\n",
    "\n",
    "    def find_ge(self, k):                                              ## 大于等于的最小\n",
    "        j = self._find_index(k, 0, len(self._table) - 1)\n",
    "        if j < len(self._table):\n",
    "            return self._table[j]._key, self._table[j]._value\n",
    "        else:\n",
    "            return None\n",
    "\n",
    "    def find_lt(self, k):\n",
    "        j = self._find_index(k, 0, len(self._table) - 1)\n",
    "        if j > 0:\n",
    "            return self._table[j - 1]._key, self._table[j - 1]._value\n",
    "        else:\n",
    "            return None\n",
    "\n",
    "    def find_gt(self, k):\n",
    "        j = self._find_index(k, 0, len(self._table) - 1)\n",
    "        if j < len(self._table) and self._table[j]._key == k:\n",
    "            j += 1\n",
    "        if j < len(self._table):\n",
    "            return self._table[j]._key, self._table[j]._value\n",
    "        else:\n",
    "            return None\n",
    "\n",
    "    def find_range(self, start, stop):\n",
    "        if start is None:\n",
    "            j = 0\n",
    "        else:\n",
    "            j = self._find_index(start, 0, len(self._table) - 1)\n",
    "        while j < len(self._table) and (stop is None or self._table[j]._key < stop):\n",
    "            yield self._table[j]._key, self._table[j]._value\n",
    "            j += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "时间复杂度：\n",
    "\n",
    "1. k in M，find_lt()等为O(log(n))，只需花费二分查找的时间。\n",
    "2. M\\[k\\]的删除和赋值，因为涉及到list长度的变化和其他元素位置的变动，所以最差情况下是O(n)。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有序映射用于相邻之间的键是有意义的场景，比如航班，出发点、时间、到达点等相近的航班可能都是用户考虑的对象。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "哈希表实现的标准映射在查找上是精确查找，而有序映射支持不那么精确的查找，可以看看周围的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.4 跳跃表"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "跳跃表（skip list）可以用来实现排序映射，前面的排序检索表也是一种实现有序映射ADT的方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "二分查找是基于数组的，需要利用索引，带来的问题是更新操作最坏情况下需要O(n)的时间复杂度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "跳跃表从一个有序的双向链表衍生而来，跳跃表的最底层是双向链表，左端增加负无穷的节点，右边增加正无穷的节点，然后每往上一层大致是前一层一半的节点，最顶层只有正无穷和负无穷两个节点。在一个有序的双向链表中进行查找，插入和删除操作都需要O(n)，如果在跳跃表中进行，则都是O(log(n))，因为在查找时（插入和删除也基于查找），跳跃表的上层可以像二分查找一样一层一层地框定范围，省去很多不必要的搜索时间，这种结构相比有序数组的二分查找来说，好在是链表实现的，在插入和删除时不会因为其他元素的移动而浪费大量时间。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "跳跃表的查找、插入和删除操作的时间复杂度分析都是分析数学期望的，因为其中有抛硬币这样的随机时间存在，这部分较为复杂，先跳过，掌握跳跃表的结构、节点（封装成位置）的结构、查找、插入和删除的算法即可。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.4.1 跳跃表中的查找和更新操作"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10.5 集合、多集和多映射"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 集合（set）：无序、不重复的元素聚集在一起。\n",
    "2. 多集（multiset）：允许有重复元素的set-like容器。\n",
    "3. 多映射（multimap）：一个键可以对应多个值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python中的集合：set和frozenset，用`哈希表`实现，collections模块定义了集合抽象基类——Mutableset。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "集合支持的方法：\n",
    "\n",
    "1. S.add(e): 如果集合中没有e，则增加e。（哈希表不一定要键值对元组，专门用于查找的哈希表，不需要值，一般的哈希表从键搜索值，专门用于查找的哈希表，只需要确定键是否存在即可）\n",
    "2. S.discard(e): 删除e，没有e则无效。（哈希表支持删除）\n",
    "3. e in S: 集合S是否包含e。（哈希表支持查找`__contains__`）\n",
    "4. len(S): 集合中元素个数。\n",
    "5. iter(S): 生成集合中所有元素的迭代器。\n",
    "\n",
    "一些其他的方法：\n",
    "\n",
    "1. S.remove(e): 删除e，没有e则报错。\n",
    "2. S.pop(): 删除集合中的任意一个元素并返回，集合为空集则报错。\n",
    "3. S.clear(): 删除集合中所有元素。\n",
    "\n",
    "集合间的方法:\n",
    "\n",
    "1. S == T: 所有元素是否相同。\n",
    "2. S != T: 与上面的互补。\n",
    "3. S <= T: S是T的子集。\n",
    "4. S < T: S是T的真子集。\n",
    "5. S >= T\n",
    "6. S > T\n",
    "7. S.isdisjoint(T): S与T是否有公共元素。\n",
    "8. S | T: 返回新集合，为S和T的并集。（\\_\\_or\\_\\_）\n",
    "9. S | =T: S更新为S和T的并集。（\\_\\_ior\\_\\_）\n",
    "10. S & T: 交集。\n",
    "11. S &= T: 同理。\n",
    "12. S ^ T: 并集去除交集的部分。\n",
    "13. S ^= T: 同理。\n",
    "14. S - T: S去除交集的部分。\n",
    "15. S -= T: 同理。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "collections类中MutableSet为五个基本方法之外的所有方法都提供了具体实现（基于五个基本方法），五个基本方法留给具体的子类实现。（template method pattern）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "以五个基本方法之外一个方法为例，`__lt__`判断子集：使用了三种基本的方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "def __lt__(self, other):\n",
    "    if len(self) >= len(other):          ## len方法\n",
    "        return False\n",
    "    for e in self:                       ## iter方法\n",
    "        if e not in other:               ## in方法\n",
    "            return False\n",
    "    return True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`S |= T`不通过`S = S | T`来实现会有更高的效率。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`S | T`的实现如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "def __or__(self, other):\n",
    "    result = type(self)()                ## 新建一个集合，具体子类名字不知道\n",
    "    for e in self:\n",
    "        result.add(e)\n",
    "    for e in other:\n",
    "        result.add(e)\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "需要的时间是O(m + n)，m和n分别是S和T的大小。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "def __ior__(self, other):\n",
    "    for e in other:\n",
    "        self.add(e)\n",
    "    return self"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "时间复杂度为O(n)，n为other的大小。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 10.5.3 集合、多集和多映射的实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "集合是一个简单的映射（只有键没有值），所以实际上用于查找的最快方法是用哈希表实现集合用于查找。任何一个map的实现，都可以将value设置为None以变成集合，但是会浪费空间，更好的办法是从一开始就用集合的元素来代替`_Item`类。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "多集的一种实现方法是使用一个map，键为不同的元素，值为元素在多集中的个数。另一种是相同元素在集合中赋予不同的存在。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "多映射的一个实现是标准映射下键对应的值用一个list来存储多个值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注：虽然前面有很多map的不同实现方法，哈希表、排序检索表和跳跃表等，但是他们所支持的方法是通过map的ADT决定的，他们都支持M\\[k\\]、M\\[k\\] = v等相同的调用方式，他们本质上所存储的东西和对应的数据是一样的，不一样的是空间复杂度和时间复杂度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "相同的ADT的不同实现往往`继承`至相同的抽象基类，适配器用于不同的ADT之间的互通。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MultiMap:\n",
    "\n",
    "    _MapType = dict()                           ## 任意一个映射，都支持相同的ADT方法\n",
    "\n",
    "    def __init__(self):\n",
    "        self._map = self._MapType\n",
    "        self._n = 0\n",
    "\n",
    "    def __iter__(self):\n",
    "        for k, secondary in self._map.items():\n",
    "            for v in secondary:\n",
    "                yield k, v\n",
    "\n",
    "    def add(self, k, v):\n",
    "        container = self._map.setdefault(k, [])\n",
    "        container.append(v)\n",
    "        self._n += 1\n",
    "\n",
    "    def pop(self, k):                           ## 只删除一个\n",
    "        secondary = self._map[k]                ## 可能报错\n",
    "        v = secondary.pop()\n",
    "        if len(secondary) == 0:\n",
    "            del self._map[k]\n",
    "        self._n -= 1\n",
    "        return k, v\n",
    "\n",
    "    def find(self, k):\n",
    "        secondary = self._map[k]\n",
    "        return k, secondary[0]\n",
    "\n",
    "    def find_all(self, k):\n",
    "        secondary = self._map.get(k, [])        ## 不会报错\n",
    "        for v in secondary:\n",
    "            yield k, v"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
